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^ ■ Abstract: In high-dimensions, many variable selection methods, such as the lasso, 

are often limited by excessive variability and rank deficiency of the sample covari- 
ance matrix. Covariance sparsity is a natural phenomenon in high-dimensional 
applications, such as microarray analysis, image processing, etc., in which a large 
number of predictors are independent or weakly correlated. In this paper, we pro- 
pose the covariance-thresholded lasso, a new class of regression methods that can 
utilize covariance sparsity to improve variable selection. We establish theoretical 
results, under the random design setting, that relate covariance sparsity to vari- 
able selection. Real-data and simulation examples indicate that our method can be 
useful in improving variable selection performances. 

> 
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1. Introduction 

' Variable selection in high-dimensional regression is a central problem in 

' Statistics and has stimulated much interest in the past few years. Motivation for 

developing effective variable selection methods in high-dimensions comes from a 
^ ', variety of applications, such as gene microarray analysis, image processing, etc., 

where it is necessary to identify a parsimonious subset of predictors to improve 
interpretability and prediction accuracy. In this paper, we consider the following 
linear model for X = (Xi, X2, . . . , Xp)'^ a vector of p predictors and Y a response 
variable, 

y = X/3* + e, (1.1) 

where /3* = (/3^, , . . . , /3*)'^ is a vector of regression coefficients and e is a normal 
random error with mean and variance a^. If /3* is nonzero, then Xj is said to 
be a true variable; otherwise, it is an irrelevant variable. Further, when only 
a few coefficients /3*'s are believed to be nonzero, we refer to (jl.ip as a sparse 
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linear model. The purpose of variable selection is to separate the true variables 
from the irrelevant ones based upon some observations of the model. In many 
applications, p can be fairly large or even larger than n. The problem of large p 
and small n presents a fundamental challenge for variable selection. 

Recently, various methods based upon Li penaliz ed least squares are pro- 
posed for variable selection. The lasso, introduced by iTibshirani (19961 ). is the 
forerunner and foundation for many of these methods. Suppose that y is an n x 1 
vector of observed responses centered to have mean and X = (Xi, X2, . . . , Xp) 
is an n X p data matrix with each column Xj standardized to have mean zero 
and variance of 1. We may reformulate the lasso as the following, 



/3^— (A„) = argmin I S/3 - 2(3' ( -X^ y ) + 2A„ 



P 



1. 



n 



fl.2) 



where S = X-^X/n is the sample covariance or correlation matrix. Consistency in 
variable sele ction for the lasso has been proved u nder the neighborhood stability 



condition in 
condition in 



Meinshausen and Buhlmann (20061 ) and under the irrepresentable 



Zhao and Yu (20061 ). Compared with traditional variable selection 



procedures, such as all subset selection, AIC, BIC, etc., the lasso has continu- 
ous solution paths an d can be computed efficiently using innovative a lgorithms, 
such as the LARS in lEfron. Hastie. Johnstone, and Tibshirani (2004 ). Since its 



introduction, the lasso has emerged as one of the most widely-used methods for 
variable selection. 

In the lasso literature, data matrix X is often assumed to be fixed. However, 
this assumption may not be realistic in high-dimensional applications, where data 
usually come from observational rather than experimental studies. In this paper, 
we assume the predictors Xi,X2, ■ ■ ■ , Xp in (jl.ip to be random with E{X) = 
and E{XX'^) = S = {(^ij)i<i<p,i<j<p- In addition, we assume that the popula- 
tion covariance matrix S is sparse in the sense that the proportion of nonzero 
(Tjj in 5] is relatively small. Motivations for studying sparse covariance matrices 
come from a myriad of applications in high-dimensions, where a large number 
of predictors can be independent or weakly correlated with each other. For ex- 
ample, in gene microarray analysis, it is often reasonable to assume that genes 
belong ing to different pathways or sys t ems are independent or we akly corre- 
lated (jRothman. Levina. and Zhu 20091 : IWagaman and Levina 20081 ). In these 
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applications, the number of nonzero covariances in S can be much smaller than 
p{p — l)/2, the total number of covariances. 

An important component of lasso regression ()1.2p is the sample covariance 
matrix 5]. We note that the sample covariance matrix is rank-deficient when 
p > n. This can cause the lasso to saturate after at most n variables are selected. 
Moreover, the 'large p and small n' scenario can cause excessive variability of 
sample covariances between the true and irrelevant variables. This deteriorates 
the ability of the lasso to separate true variables from irrelevant ones. More 
specifically, a sufficient and alm ost necessary conditio n for the lasso to be variable 
selection consistent is derived in lZhao and Yu (20061 ) . which they call the irrepre- 



sentable condition. It poses constraint on the inter-connectivity between the true 
and irrelevant variables in the following way. Let S = {j £ {1, . . . , p} \ 13* ^ 0} 
and C = {1,2, .. . ,p} — S, such that S is the collection of true variables and 
C is the complement of S that is composed of the irrelevant variables. Assume 
that the cardinality of S is s; in other words, there are s true variables and 
p — s irrelevant ones. Further, let and Xc be sub-data matrices of X that 
contain the observations of the true and irrelevant variables, respectively. Define 
i =1 T,csi'^ss)~^ sgn{(3*g) |, where Sc5 = ^c^s/n and f^ss = X^Xs/n. We 
refer to X as the sample irrepresentable index. It can be interpreted as represent- 
ing the amount of inter-connectivity between the true and irrelevant variables. In 
order for lasso to select the true variables consistently, irrepresentable condition 
requires X to be bounded from above, that is X < 1 — e for some e S (0, 1), entry- 
wise. Clearly, excessive variability of the sample covariance matrix induced by 
large p and small n can cause X to exhibit large variation that makes the irrepre- 
sentable condition less likely to hold. These inadequacies motivate us to consider 
alternatives to the sample covariance matrix to improve variable selection for the 
lasso in high-dimensions. 

Next, we provide some insight on how the sparsity of the population covari- 
ance matrix can influence variable selection for the lasso. Under random design 
assumption on X, the inter-connectivity between the true and irrelevant variables 
can be stated in terms of their population variances and covariances. Let 
be the covariance matrix between the irrelevant variables and true variables and 
'^ss the variance-covariance matrix of the true variables. We define the popula- 
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tion irrepresentable index as I =| Sc5'S^^S(7n(/3^) | . Intuitively, the sparser the 
population covariances Sc5 and S55 are, or the sparser I] is, the more likely 
that I < 1 — e, entry-wise. This property, however, does not automatically trickle 
down to the sample irrepresentable index X, due to its excessive variability. When 
and S55 are known a priori to be sparse and I < 1 — e, entry-wise, some 
regularization on the covariance can be used to reduce the variabilities of 5] and 
I and allow the irrepresentable condition to hold more easily for I. Furthermore, 
the sample covariance matrix 51 = X"^X/n is obviously non-sparse; and impos- 
ing sparsity on S has the benefit of sometimes increasing the rank of the sample 
covariance matrix. 

We use an example to demonstrate how rank deficiency and excessive vari- 
ability of the sample covariance matrix 5] can compromise the performance of 
the lasso for large p and small n. Suppose there are 40 variables {p = 40) 
and S = Jp ( /p is the p x p identity matrix). Since all variables are in- 
dependent of each other, the population irrepresentable index clearly satisfies 
Z < 1 — e, entry-wise. Further, we let /3* = 2, for 1 < j < 10, and /3* = 0, 
for 11 < j < 40. The error standard deviation a is set to be about 6.3 to 
have a signal-to- noise ratio of approximately 1. The lasso, in general, does 
not take into consideration the structural properties of the model, such as the 
sparsity or the orthogonality of S in this example. One way to take advan- 
tage of the orthogonality of 5] is to replace 51 in ()1.2p by Ip, which leads to 
the univariate soft thresholding (UST) estimates = sgn{rj){\ rj — X \)~^ , 

where rj = Xjy/n for 1 < j < p- We compare the performances of the 
lasso and UST over various sample sizes (5 < n < 250) using the variable 
selection measure G. G is defined as the geometric mean between sensitiv- 
i ty, (no. of true positives) /s, and specificity, 1 — (no . of false positives') /(p — s) 



(jTibshirani. Saunders. Rosset. Zhu. and Knight 20051 : IChong and Jun 20051 : iKubat. Holte. and Matwin 19981 ) . 
G varies between and 1. Larger G indicates better selection with a larger pro- 
portion of variables classified correctly. 

Figure [T] plots the median G based on 200 replications for the lasso and UST 
against sample sizes. For each replication, A is determined ex post facto by the 
optimal G in order to avoid stochastic errors from tuning parameter estimation, 
such as by using cross-validation. It is clear from Figure [U that, when n > 20, 
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lasso slightly outperforms UST; when n < 20, the performance of lasso starts 
to deteriorate precipitously, whereas the performance of UST declines at a much 
slower pace and starts to outperform lasso. This example suggests that when p 
is large and n is relatively small, sparsity of X) can be used to enhance variable 
selection. 
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Figure 1: Median G (from n—h to ?t.=250) for illustrating example based upon 200 
replications. 




The discussions above motivate us to consider improving the performance 
of the lasso by applying regularization to the sample covariance matrix S. A 
good sparse covariance-regularizing operator on 5] should satisfy the following 
properties: 

1. The operator stabilizes S. 

2. The operator can increase the rank of S. 

3. The operator utilizes the underlying sparsity of the covariance matrix. 
The first and second properties are obviously useful anc l have been explored i n 



the literature. For example, the elastic net, introduced in lZou and Hastie (20051 ). 
replaces S by Tien = (S + A2/)/(l + A2) in (|1.2p . where A2 > is a tuning 
parameter. T,en can be more stable and have higher rank than S but is non- 
sparse. Nonetheless, in many applications, utilizing the underlying sparsity may 
be more crucial in improving the lasso when data is scarce, such as under the 
large p and small n scenario. 
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Recently, various regularization methods have been proposed in the hter- 



ature for estimating high-dimensional 
amples include tapering p roposed by 



variance-covariance matrices. Some ex- 



Furrer and Benetsson ('20071). band ing by 



Bickel and Levina (;2008bl ). thresh olding bvlBickel and Levina (2008^ ) and lEl Karoui (20081 ). 



and generalized thresholding by iRothman. Levina. and Zhu (2003 ). We note 
that covariance thresholding operators can satisfy all three properties outlined in 
the previous paragraph; in particular, they can generate sparse covariance esti- 
mates to accommodate for the covariance sparsity assumption. In this paper, we 
propose to apply covariance-thresholding on the sample covariance matrix 5] in 
()1.2p to stabilize and improve the performances of the lasso. We call this proce- 
dure the covariance-thresholded lasso. We establish theoretical results that relate 
the sparsity of the covariance matrix with variable selection and compare them to 
those of the lasso. Simulation and real-data examples are reported. Our results 
suggest that covariance-thresholded lasso can improve upon the lasso, adaptive 
lasso, and elastic net, especially when S is sparse, n is small, and p is large. 
Even when the underlying covariance is non-sparse, covariance-thresholded lasso 
is st ill useful in providing robust variable selection in high-dimensions. 



Witten and Tibshirani (20091 ) has recently proposed the scout procedure. 



that applies regularization to the inverse covariance or precision matrix. We note 
that this is quite different from the covariance-thresholded lasso that regularizes 
the sample covariance matrix 5] directly. Furthermore, the scout penalizes using 
the matrix norm ||0xx||p = Xltj l^ij^i where is an estimate of whereas 
the covariance-thresholded lasso regularizes individual covariances &ij directly. 
In our results, we will show that the scout is potentially very similar to the elas- 
tic net and that the covariance-thresholded lasso can often outperform the scout 
in terms of variable selection for p > n. 

The rest of the paper is organized as follows. In Section 2, we present 
covariance-thresholded lasso in detail and a modified LARS algorithm for our 
method. We discuss a generalized class of covariance-thresholding operators and 
explain how covariance-thresholding can stabilize the LARS algorithm for the 
lasso. In Section 3, we establish theoretical results on variable selection for 
the covariance-thresholded lasso. The effect of covariance sparsity on variable 
selection is especially highlighted. In Section 4, we provide simulation results 
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of covariance-thresholded lasso at p > n, and, in Section 5, we compare the 
performances of covariance-thresholded lasso with those of the lasso, adaptive 
lasso, and elastic net using 3 real-data sets. Section 6 concludes with further 
discussions and implications. 

2. The Covariance-Thresholded Lasso 

Suppose that the response y is centered and each column of the data matrix 
X is standardized, as in the lasso (|1.2p . We define the covariance-thresholded 
lasso estimate as 

/3^^-^'^^^°(i^,A„) = argmm|/3^S,/3-2/3^ (^X^y) + 2A„ , (2.3) 

where = [a^j], af^- = s^{aij), dij = YJi^i XkiX^j/n, and s^{-) is a pre-defined 
covariance-thresholding operator with < z/ < 1. If the identity function is 
used as the covariance-thresholding operator, that is Si,{x) = x for any x, then 

2.1. Sparse Covariance-thresholding Operators 

We consi der a generalized class of covarian ce-thresholding operators s^{-) 



introduced in iRothman. Levina. and Zhu (20091 ). These operators satisfy the 



following properties, 

Sui^ij) = for \aij\ < u, \su{aij)\ < \aij\, |s^(ajj) - aij\ < v. (2.4) 

The first property enforces sparsity for covariance estimation; the second allows 
shrinkage of covariances; and the third limits the amount of shrinkage. These 
operators satisfy the desired properties outlined in the Introduction for sparse 
covariance-regularizing operators and represent a wide spectrum of thresholding 
procedures that can induce sparsity and stabilize the sample covariance matrix. 
In this paper, we will consider the following covariance-thresholding operators 
for (Jij when i ^ j. 

1. Hard thresholding: s^^'"^{aij) = aijl{\aij\ > v). (2.5) 

2. Soft thresholding: s^°f*((Tij) = sgn{aij){\aij\ - i/)+. (2.6) 

3. Adaptive thresholding: For 7 > 0, 

st''''''\^^,) = sgn{a,,){\a,,\ - z.^+V..r^) + - (2-7) 



CT Hard 
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Figure 2: Hard, soft, adaptive (7=2) sparse covariance-thresholding operators with v 
varying over (solid), 0.25 (dashed), 0.5 (dot-dashed), and 0.75 (dotted); and elastic 
net covariance-regularizing operator with A2 varying over (solid), 0.5 (dashed), 1.5 
(dot-dashed), and 4 (dotted). 



The above operators are used in lRothman. Levina. and Zhu (20091 ) for estimat- 
ing variance-covariance matrices, and it is easy to check that they satisfy the 
properties in (|2.4p . 

In Figure [21 we depict the sparse covariance-thresholding operators (j2.5ti2.7p 
for varying v. Hard thresholding presents a discontinuous thresholding of covari- 
ances, whereas soft thresholding offers continuous shrinkage. Adaptive thresh- 
olding presents less regularization on covariances with large magnitudes than soft 
thresholding. 

Figure [2] further includes the elastic net covariance-regularizing operator, 
'^\-i[pij) = {^ij + -^2)7(1 + A2) for i / j. Apparently, this operator is non-sparse 
and does not satisfy the first property in (12. 4p . In particular, we see that the 
elastic net penalizes covariances with large magnitudes more severely than those 
with small magnitudes. In some situations, this has the benefit of alleviating 
multicollinearity as it shrinks covariances of highly correlated variables. However, 
under high-dimensionality and when much of the random perturbation of the 
covariance matrix arises from small but numerous covariances, the elastic net in 
attempting to control these variabilities may inadvertently penalize covariances 
with large magnitudes severely, which may introduce large bias in estimation and 
compromise the performance of the elastic net under some scenarios. 



2.2. Computations 

The lasso solution path s are shown to be piec ewise linear inlEfron. Hastie. Johnstone, and Tibshirani (20041 ) 
and lRosset and Zhu (20071 ). This propertv allows lEfron. Hastie. Johnstone, and Tibshirani (20041 ) 
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to propose the efficient LARS algorithm for the lasso. Likewise, in this section, 
we propose a piecewise-linear algorithm for the covariance-thresholded lasso. 

We note that the loss function /3"^S,y/3— 2/3^X"^y/n in (12. 3p can sometimes be 
non-convex since S^, may possess negative eigenvalues for some i'. This usually 
may occur for intermediary values of z/, as 'S^, is at least semi-positive definite 
for 1/ close to or 1. Furthermore, we note that the penalty 2A„||/?||i is a convex 
function and dominates in (12. 3p for A„ large. Lituitively, this means that the 
optimization problem for covariance-thresholded lasso is almost convex for f3 
sparse. This is stated conservatively in the f ollowing theorem b y using second- 



order condition from nonlinear programming (jMcCormick 19761 ). 



Theorem 2.1 Let ly he fixed. If '^u is semi-positive definite, the covariance- 
thresholded lasso solutions jjCT-Lasso^^^ ( fg. gj) are piecewise linear with re- 
spect to A„. If^u possesses negative eigenvalues, a set of covariance-thresholded 
lasso solutions, which may be local minima for \2. 3|) under strict complementar- 
ity, is piecewise linear with respect to A„ for A„ > A*, where A* = min{A > : 
sub-matrix (5],^)^ remains positive definite for A = {j \ /3j^"^~^"*'*°(z^, A) 7^ 0}} 

The proof for Theorem 12.11 is outlined in Appendix 7.6. Strict complementarity, 
described in Appendix 7.6, is a technical condition that allows the second-order 
condition to be more easily interpreted and usually holds with high probability. 
We note that, when has negative eigenvalues, the solution /3*^-^~^'^*'*°(z^, A„) 
is global if |xjy/n| < A„ for all j ^ A = {j : /3f ^-^-^^^"(z^, A„) / 0} and 
is positive definite. Theorem 12.11 suggests that piecewise linearity of the 
covariance-thresholded lasso solution path sometimes may not hold for some v 
when An is small, even if a solution may well exist. This restricts the sets of 
tuning parameters {v, A„) for which we can compute the solutions of covariance- 
thresholded lasso efficiently using a LARS-type algorithm. We note that the 
elastic net does not suffer from a potentially non-convex optimization. However, 
as we will demonstrate in Figure [3]of Section 4, covariance-thresholded lasso with 
restricted sets of [v, A„) is, nevertheless, rich enough to dominate the elastic net 
in many situations. 

Theorem 12.11 establishes that a set of covariance-thresholded lasso solutions 
are piecewise linear. This further provides us with an efficient modified LARS 
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algorithm for computing the covariance-thresholded lasso. Let 

(c,), = -Xjy - (5],)J/3 (2.8) 

be estimates for the covariate-residual correlations Cj. Further, we denote the 
minimum eigenvalue of A as Amm(^)- The covariance-thresholded lasso can be 
computed with the following algorithm. 

ALGORITHM: Covariance-thresholded LARS 

1. Initialize Jly such that a^j = s^{aij), /3 = 0, and Cy = ^X-^y. Let A = 
argmaxj |(cjy)j|, C = max |(c,^)_4|, 7^ = sgn{{cy)ji), jj^c = 0, and a = 

2. Let 6, = min+ ^{-|} and 62 = min+ for any i G A, 
where min"*" is taken only over positive elements. 

3. Let 5 = min(5i, ^2), /? /? + <^7; Cy — 6a, and C = maxjg^ 

4-. If 6 = 5i, remove the variable hitting Q at 5 from A. If 5 = 62, add the 
variable first attaining equality at 5 to A. 

5. Compute the new direction, 7^ = (Sjy)^^s5fn(/3^) and 7^40 = 0, and let 
a= (S,f7. 

6. Repeat steps 2-5 until miujg^ < or Amm((Si/)^) < 0. 

The covariate-residual correlations Cj are the most crucial for computing 
the solution paths. It determines the variable to be included at each step and 
relates directly to the tuning parameter A„. In the original LARS for the lasso, 
Cj is estimated as Xjy/n — which uses the sample covariance matrix SI 

without thresholding. In covariance-thresholded LARS, {cy)j is defined using the 
covariance-thresholded estimate (Si,)J = {a'(j,a2j, ■ ■ ■ ^^pj): which may contain 
many zeros. We note that, in (12. Sp . zero- valued covariances a^j have the effect of 
essentially removing the associated coefficients from (3, providing parsimonious 
estimates for cj. This allows covariance-thresholded LARS to estimate cj in a 
more stable way than the LARS. It is clear that covariance-thresholded LARS 
presents an advantage if population covariance is sparse. On the other hand, if 
the covariance is non-sparse, covariance-thresholded LARS can still outperform 
the LARS when the sample size is small or the data are noisy. This is because 
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parsimonious estimates {cu)j of cj can be more robust against random variability 
of the data. 

Moreover, consider computing the direction of the solution paths in Step 5, 
which is used for updating {cu)j- LARS for the lasso updates new directions with 
{'S)^^sgn{l3ji), whereas covariance-thresholded LARS uses 7^ = {'S,^)^^sgn{(3j^). 
Apparently, covariance-thresholded LARS can exploit potential covariance spar- 
sity to improve and stabilize estimates of the directions of the solution paths. In 
addition, the LARS for the lasso can stop early before all true variables S can be 
considered if S_4 is rank deficient at an early stage when sample size is limited. 
Covariance-thresholding can mitigate this problem by proceeding further with 
properly chosen values of v. For example, when — t- 1, 5],/ converges towards 
the identity matrix I, which is full-ranked. 

3. Theoretical Results on Variable Selection 

In this section, we derive sufficient conditions for covariance-thresholded lasso 
to be consistent in selecting the true variables. We relate covariance sparsity 
with variable selection and demonstrate the pivotal role that covariance sparsity 
plays in improving variable selection under high-dimensionality. Furthermore, 
variable selection results for the lasso under the random design are derived and 
compared with those of the covariance-thresholded lasso. We show that the 
covariance-thresholded lasso, by utilizing covariance sparsity through a properly 
chosen thresholding level u, can improve upon the lasso in terms of variable 
selection. 

For simplicity, we assume that a solution for ()2.3p exists and denote the 
covariance-thresholded lasso estimate fjCT-Lasso^^^ ^j^jg gg^i^ion. Fur- 

ther, we let supp{P) = {j : (3j 7^ 0} represent the collection of indices of nonzero 
coefficients. We say that the covariance-thresholded lasso estimate P'^ is variable 
selection consistent if P(supp{/3'^) = supp{f3*)) — )• 1, as n — )• 00. In addition, we 
say that (3'^ is sign consistent if P{sgn{j3'^) = sgn{f3*)) — )• 1, as n — >■ 00, wher e 



sgn{t) = —1,0, 1 when t < 0, t = and t > 0, respectively (jZhao and Yu 20061 ). 
Obviously, sign consistency is a stronger property and implies variable selection 
consistency. 

We introduce two quantities to characterize the sparsity of Xl that plays a 
pivotal role in the performance of covariance-thresholded lasso. Recall that S 
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and C are collections of the true and irrelevant variables, respectively. Define 

= max l((7jj 7^ 0) and d^g = maxy~^ ^{o'ij 0). (3.9) 

ranges between 1 and s. When d*^^ = 1, all pairs of the true variables are 
orthogonal. When d*^^ = s, there are at least one variable correlated with all 
other variables. Similarly, d^^ is between and s. When = 0, the true 
and irrelevant variables are orthogonal to each other, and, when d^g = s, some 
irrelevant variables are correlated with all the true variables. The values of dgg 
and d^g represent the sparsity of covariance sub-matrices for the true variables 
and between the irrelevant and true variables, respectively. We have not specified 
the sparsity of the sub-matrix for the irrelevant variables themselves. It will be 
clear later that it is the structure of 'Sss and 'Sqs instead of Slcc that plays 
the pivotal role in variable selecti on. We note that d%G and dQ^ are related to 



another notion of sparsity used in lBickel and Levina (2008al ) to define the class 



of matrices {S : o"jj < M, ^^^-^ l{aij ^ 0) < co{p) ioi 1 < i < p}, for M given 
and co{p) a constant depending on p. We use the specific quantities d^g and d'^g 
in ()3.9p in order to provide easier presentation of our results for variable selection. 
Our results in this section can be applied to more general characterizations of 



sparsity, such as in lBickel and Levina (2008al ). 



In this paper, we employ two different types of matrix norms. For an arbi- 
trary matrix A = [Aij], the infinity norm is defined as ||^||oo = niaxj 
and the spectral norm is defined as \\A\\ = max^.^^^^^^i \\Ax\\ = Amaxi^)- We 
use Amax{A) and Amm(^) to represent, respectively, the largest and smallest 
eigenvalues of A. 



3.1. Sign Consistency of Covariance-thresholded Lasso 

We develop sign consistency results for covariance-thresholded lasso. Proofs 
for the results are presented in the Appendix. 

We first provide conditions for the covariance-thresholded lasso estimate /3'^ 
to have the same signs as the true coefficients f3* under the fixed design assump- 
tion. Let p = maxjg5 \/3*\ and p = minj^s If^jl- 

Lemma 3.1 Suppose that the data matrix X is fixed and v is given. Then, 
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^SS 



> 0, 



and 



n ^ 



+ sup + A„ + sup + 



n ^ 



(3.10) 



<A„, (3.11) 



1 rr. 

-X|e 

n ^ 



+ sup + Xn] < p 



(3.12) 



The above (j3.10p . (j3.1ip . and (j3.12p are derived from the Karush-Kuhn- Tucker 
(KKT) conditions for the optimization problem presented in (12. 3p when the 
solution, which ma y be a local minimum, exists. Following the arguments in 
Zhao and Yu (20061 ) and IWainwright (20061 ). these conditions are almost neces- 
sary for to have the correct signs. The condition (|3.10p is needed for (j3.1ip 
and (|3.12p to be valid. That is, the conditions (|3.1ip and p.l2p are ill-defined if 
singular. 

Assume the random design setting so that X is drawn from some distribution 
with population covariance S. We demonstrate how the sparsity of S^s and the 
procedure of covariance-thresholding work together to ensure that the condition 
(j3.10p is satisfied. We impose the following moment conditions on the random 
predictors Xi, . . . , Xp-. 



EXj = 0, EX]'^ < dlW, l<j<p, 



for some constant M > and d G N. Assume that 



Amm C^SS) > 



and dgg, s, and n satisfy 



d*ssV^ogs/Vn 0. 



(3.13) 

(3.14) 
(3.15) 



We have the following lemma. 



Lemma 3.2 Let v = Cy/\ogs/ ^/n for some constant C > 0. Under the condi- 
tions [SAM . and [TBI) . 



PiAmini^SS^ >0) ^1. 



(3.16) 
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The rate of convergence for p.l6p depends on the rate of convergence for ()3.15p . 
It is clear that the smaller dgg (or the sparser 'Sss) is, the faster (I3.15p . as 
well as (I3.16p . converges. Equivalently, for sample size n fixed, the smaller d*gg 
is, the larger the probability that AminC^ss) ^ ^- other words, covariance- 
thresholding can help to fix potential rank deficiency of '^ss when S^s is sparse. 
In the special case when 'Sss = Ip and d*gg = 1, it can be shown that is 
asymptotically positive definite provided that s = o(exp(n)). 

Next, we investigate the remaining two conditions (j3.1ip and ()3.12p in Lemma 
3.1. For (j3.1ip and ()3.12p to hold with probability going to 1, additional assump- 
tions including the ir represent able condition need to be imposed. Since the data 
matrix X is assumed to be random, the original ir represent able condition needs 
to be stated in terms of the population covariance matrix S as follows, 



< 1 



(3.17) 



for some < e < 1. We note that the original irrepresentable condition in 
Zhao and Yu (20061 ) also involves the signs of /3^. To simplify presentation, we 



use the stronger condition (I3.17P instead. Obviously, (13.171) does not directly 
imply that \\t''csi^ss)~^\\oo < 1- 
behaviors of 



loo and 

Dd*ssy^log{p 



The next lemma establishes the asymptotic 

ss) 



(E^5)-i||oo. LetL'= ||(S.9,s)-^llno. Assume 



n 



0, 



D^d*csd*ssV'^og(j)-s)/V^ ^ 0. 



(3.18) 
(3.19) 



Lemma 3.3 Suppose that p — s > s and v = C y^log{s{p — s))/ y/n for some 
constant C > 0. Under conditions (SAM . JXITP , JXTTj ), (SA^) . and (M^, 



'^ss 



P 



'^csi'^s 



ss) 



< D 



< 1 



1, 



1. 



(3.20) 



(3.21) 



The above lemma indicates that with a properly chosen thresholding parame- 
ter u and sample size depending on covariance-sparsity quantities d^g and d^g, 
)~^||oo and ||li^o(Scc)~^ lloo behave as their population counterparts 



i^SSJ 



both II 

mss 

sity of S on 



^lloo and ll^c'S^sglloo, asymptotically. Again, the infiuence of the spar- 
iii ii-^i^ rvu' 



i^ss) 



oo 



and W^hsi^h) 



oo 



is shown through d%a and d 



"-SS- 
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Asymptotically, the smaller and dg^ are, the faster ()3.20p and ()3.2ip con- 
verge. Or equivalently, for sample size n fixed, the smaller d^g and dgg are, the 
larger the probabilities in (I3.20p and (13.2ip are. In the special case when d'^g = 
or 'Scs is a zero matrix, condition (j3.19p is always satisfied. 

Finally, we are ready to state the sign consistency result for f3'^. With the 
help of Lemmas 1-3 stated above, the only issue left is to show the existence of 
a proper A„ such that (13. lip and (I3.12P hold with probability going to 1. One 
more condition is needed. We assume that 



Dpsy/log{p-s)/{pV^) ^ 0. (3.22) 



Theorem 3.2 Suppose that p — s > s, v = C ^J\og{s{p — s))/^/n for some con- 
stant C > 0, and is chosen such that A„ — )• 0, 



\fn\nl {spy^log{p — s)) — )• oo, and DA„//?— )-0. (3.23) 
Then, under conditions / TOl) . [3J4\ ), [3Jl\ ), IgJl) . and {51 



P [sgnin = sgnin) ^ 1- (3-24) 

We note that the assumption p — s > s is natural for high-dimensional sparse 
models, which usually have a large number of irrelevant variables. This assump- 
tion effects the conditions (|3.19p and ()3.22p as well as choices of and A„ . When 
p — s < s, that is a non-sparse linear model is assumed, the conditions for to 



be sign consistent need to be modified by choosing as = C^/^ogs/^/n and 
replacing y^log{p — s) by ^/logs in conditions (|3.19p , (j3.22p , and (I3.23P . 

It is possible to establish the convergence rate for the probability in ()3.24p 
more explicitly. For simplicity of presentation, we provide such a result under a 
special case in the following theorem. 

Theorem 3.3 Suppose that conditions iS.lS]) . |3. and Jg. i?] ) hold and D, 
p, and p are constants. Let A„ = n~''^, u = n~^'^ , d*g = max{d*gg,dQg} = n^^ , 
s = n^^ , and log p = o{n^~''^'^+n{n~^'^—n~^^)'^), where c, ci, C2, and c-^ are positive 
constants such that c < 1/2, ci < 1/2, C2 < 1/4, C2 < C3, and C3 + c < ci. Then, 

P (^sgni^") = s5n(/3*)) > 1-0 {exp{-ain^-^^))-0 {exp{-a2n{n-^^^ - n-^^)^)) 

(3.25) 

where ai and 02 o,re some positive constants depending on e, D, M and p. 
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The proof of Theorem 13. 3| which we omit, is similar to that of Theorem! 
We note that the conditions on dimension parameters in Theorem 13.21 are now 
expressed in the convergence rate of (13.25p . It is clear that the smaller d*^ is, the 
larger the probability is in (!3.25p . 

3.2. Comparison with the Lasso 

We compare sign consistency results of covariance-thresholded lasso with 
those of the lasso. By choosing = 0, the covariance-thresholded lasso es- 
timate (3'^ can be reduced to the lasso estimate (3^. Re sults on sign consis - 



tency of the lasso have been estab li shed in the literatu re (jZhao and Yu (20061 ). 



Meinshausen and Buhlmann (20061 ) . IWainwright (20061 )). To facilitate compari 



son, we restate sign consistency results for in the same way that we presented 
results for in Section 3.1 . The proofs, which we omit, for sign consistency of 
13^ is similar to those for 

First, assuming fixed design, we have the sufficient and almost necessary 
conditions for sgn0^) = sgn{l3*) as in (l3lII]) - (IXT2] l with v = {). 

Next, we assume the random design. Analogous to Lemma [3. 2 ^ the sufficient 
conditions for P{K^in{t,ss) > 0) ^ 1 are (fXT3]) . (fXTil) . and 



sVbgs/Vr^ ^ 0. (3.26) 

Compared to ()3.15p . (j3.26p is clearly more demanding since d*gg is always less than 
or equal to s. Note that a necessary condition for "t^ss to be non-singular is s < n, 
which is not required for 5]^^. Thus, the non-singularity of the sample covariance 
sub-matrix S55 is harder to attain than that of S^^. In other words, covariance- 
thresholded lasso may increase the rank of "^ss by thresholding. When 5] is 
sparse, this can be beneficial for variable selection under the large p and small n 
scenario. 

To ensure that P(|| (±55)^^00 < D) ^ I and P(||l:c5(S5s)"^ lU < 1 - 
e/2) —7- 1, as in Lemma 13.31 with = 0, we further assume the conditions (j3.17p 
and 



L'2sVlog(p - s)/\/^ ^ 0. (3.27) 

We note that p.27p is the main condition that guarantees that S satisfies the 
irrepresentable condition with probability going to 1. Compared with (j3.19p . 
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(j3.27p is clearly more demanding since s is larger than both and d*^^. This 
implies that it is harder for 5] than for 'Si, to satisfy the irrepresentable condition. 
In other words, covariance-thresholded lasso is more likely to be variable selection 
consistent than the lasso when data are randomly generated from a distribution 
that satisfies (|3.17p . 

Finally, with the additional condition. 



Dy^s/ipV^) ^ 0, (3.28) 
we arrive at the sign consistency of the lasso as the following. 



Corollary 3.1 Assume that the conditions 13.13\) . (3J^, (fg.iTp , (fg.g7| j, and 
113. 28\) are satisfied. If A„ is chosen such that Xn — )• 0, 

\fn\nl -\/log(p — s) — )• oo, and DXn/p^Q, (3.29) 

then, P (sgn{p^) = sgn{l3*)^ 



Compare Corollary 13.11 with Theorem 13.21 for covariance-thresholded lasso. We 
see that conditions (|3.13p . (|3.14p . (j3.17p on random predictors, in particular the 
covariances, are the same, but conditions on dimension parameters, such as n, p, 
s, etc., are different. When the population covariance matrix 5] is sparse, condi- 
tion (j3.19p on dimension parameters is much weaker for covariance-thresholded 
lasso than condition ()3.27p for the lasso . This shows that covariance-thresholded 
lasso can improve the possibility of there existing a consistent solution. However, 
a trade-off presents in the selection of tuning parameters An. The first condition 
in (|3.23p for covariance-thresholded lasso is clearly more restricted than the con- 
dition in (j3.29p for the lasso. This results in a more restricted range for valid A„. 
We argue that compared with the existence of consistent solution, the range of 
the Xn is of secondary concern. 

We note that a related s ign consistency res ult under random design for the 
lasso has been established in IWainwright (20061 ) . They assume that the predic- 



tors are normally distributed and utilize the resulting distributi on of the sample 



Wainwright r200fil ) include (f3T4]l . 



covariance matrix. The conditions used in 
(I3.17p . Amaxi^) < oo, Z) < oo, log(p - s) / {n - s) ^ 0, ^/\og s/{p^) 0, and 
n> 2 [Amax{'^)/{e-'^Amin{^ss)) + i^) slog{p - s) + s + 1, for some constant u > 0. 
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In comparison, we assume, in this paper, that the random predictors follow the 
more general moment conditions ()3.13p . which contain the Gaussian assumption 
as a special case. Moreover, we use a new approach to establish sign consistency 
that can incorporate the sparsity of the covariance matrix. 

4. Simulations 

In this section, we examine the finite-sample performances of the covariance- 
thresholded lasso for p > n and compare them to those of the lasso, adaptive 
lasso with univariate as initial estimates, UST, scout(l,l), scout(2,l), and elastic 
net. Further, we propose a novel variant of cross-validation that allows improved 
variable selection when n is much less than p. We note that the scout(l,l) 
procedure can be computationally expensive. Results for scout(l,l) that take 
longer than 5 days on an RCAC cluster were not shown. 

We compare variable selection performances using the G-measure, G = 



\/ sensitivity * specificity. G is defined as the geometric mean between sensitiv- 
ity, (no. of true positives) /s, and specificity, 1 — (no. of false positives) /(p — s). 
Sensitivity and specificity can be interpreted as the proportion of selecting the 
true variables correctly and discarding the irrelevant variables correctly, respec- 
tively. Sensitivity can also be defined as 1 minus false negative rate and specificity 
as 1 minus false positive rate. A value close to 1 for G indicates good selection, 
whereas a value close to implies that few true variables or too many irrelevant 
variables are selected, or both. Furthermore, we compare prediction accuracy 
using the relative prediction error (RPE), RPE = (/3 - /3*f'S0 - 13*) /a^ where 
S is the population covariance matrix. The RPE is obtained by re-scaling the 
mean-squared error (ME), as in Ixibshirani (199^ ). by 



We first present variable selection results using best-possible selection of tun- 
ing parameters, where tuning parameters are selected ex post facto based on the 
best G. This procedure is useful in examining variable selection performances, 
free from both inherent variabilities in estimating the tuning parameters and 
possible differences in the validation procedures used. Moreover, it is important 
as an informant of the possible potentials of the methods examined. We present 
median G out of 200 replications using best-possible selection of tuning parame- 
ters. Standard errors based on 500 bootstrapped re-samplings are very small, in 
the hundredth decimal place, for median G and are not shown. 
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(a) Example 1 



(b) Example 2 
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Figure 3: Variable selection performances using best-possible selection of tuning param- 
eters based on 200 replications at n = {5, 10, 15, 20, 25, 30, 40, 50, 60, 80, 100}. 



Results from best-possible selection of tuning parameters allow us to un- 
derstand the potential advantages of the methods if one chooses their tuning 
parameters correctly. However, in practice, possible errors due to the selection 
of tuning parameters may sometimes overcome the benefit of introducing them. 
Hence, we include additional results that use cross-validation to select tuning 
parameters. 

We study variable selection methods using a novel variant of the usual cross- 
validation to estimate the model complexity parameter A„ that allows improved 
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variable selection when p ^ n. Conventional cross-validation selects tuning pa- 
rameters based upon the minimum validation error, obtained from the average 
of sum-of-squares errors from each fold. It is well known that, when the sample 
size n is large compared with the number of predictors p, procedures such as 
cross-validation that are prediction-based tend to over-select. This is because, 
when the sample size is large, regression methods tend to produce small but 
non-zero estimates for coefficients of irrelevant variables and over-training oc- 
curs. On the other hand, we note that a different scenario occurs when p ^ n. 
In this situation, prediction-based procedures, such as the usual cross-validation, 
tend to under-select important variables. This is because, when n is small, in- 
clusion of a relatively few irrelevant variables can increase the validation error 
dramatically, resulting in severe instability and under-representation of impor- 
tant variables. In this paper, we propose to use a variant of the usual cross- 
validation, in which we include additional variables by decreasing for up to 
1 standard deviation of the validation error at the minimum. Through exten- 
sive empirical studies, we found that this strategy often works well to prevent 
under-selection when n/^/p < 5, which corresponds to n < 50 when p = 100 
and n < 224 when p = 2000. For nj ^ > 5 and sample size n only moder- 
ately large, we use the usual cross-vali dation at the minimum. We note that 



Hastie. Tibshirani. and Friedman (200ll . p. 216) have described a related strat- 



egy that discards variables up to 1 standard deviation of the minimum cross- 
validation error for use when n is large relative to p and over-selection is severe. 
In Table [T][3l we present median RPE, number of true and false positives, sensi- 
tivity, specificity, and G out of 200 replications using modified cross-validation for 
selecting tuning parameters. The smallest 3 values of median RPE and largest 
3 of median G are highlighted in bold. Standard errors based on 500 boot- 
strapped re-samplings are further reported in parentheses for median RPE and 
G. In Table m we provide an additional simulation study to illustrate the modified 
cross-validation. 

In each example, we simulate 200 data sets from the true model, y = X/3* + 
ere, where e ~ A^(0, /). X is generated each time from A^(0, S), and we vary S, /3*, 
and a in each example to illustrate performances across a variety of situations. 
We choose the tuning parameter 7 from {0,0.5,1,2} for both adaptive lasso 
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(|Zou 20061 ) and covariance-thresholded lasso with adaptive thresholding. The 
adaptive lasso seeks to improve upon the lasso by applying the weights l/l/SoP, 
where /3o is an initial estimate, in order to penalize each coefficient difi'erently 
in the Ll-norm of the lasso. The larger 7 is the less the shrinkage applied to 
coefficientsof large magnitudes. The candidate values used for 7 are suggested 



m 



Zou (20061 ) and found to work well in practice. 



Example 1. (Autocorrelated.) This example has p = 100 predictors 
with coefficients /3* = 3 for j G {1, . . . , 5}, /3* = 1.5 for j G {11, . . . , 15}, and 
/3| = otherwise. = O.S'^^-'I for all and a = 9. Signal-to-noise ratio 
(SNR) 0*'^'S0* /a "^ is approximately 1.55. This example, similar to Example 1 in 
(jTibshirani 19961 ) . has an approximately sparse covariance structure, as elements 
away from the diagonal can be extremely small. 

Figure ^a) depicts variable selection results using best-possible selection 
of tuning parameters. We see that the covariance-thresholded lasso methods 
dominate the lasso, adaptive lasso, and UST in terms of variable selection for 
p > n. The performances of lasso and adaptive lasso deteriorate precipitously 
as n becomes small, whereas those of the covariance-thresholded lasso methods 
decrease at a relatively slow pace. Furthermore, the covariance-thresholded lasso 
methods dominate the elastic net and scout for n small. We also observe that the 
sc out procedures and elastic ne t perform very similarly. This is not surprising 



as 



Witten and Tibshirani (20091 ) have shown in Section 2.5.1 of their paper that 



scout (2,1), by regularizing the inverse covariance matrix, is very similar to the 
elastic net. 

Results from best-possible selection provide information on the potentials of 
the methods examined. In Table [U we present results using cross-validation to 
illustrate performances in practice. The covariance-thresholded lasso methods 
tend to dominate the lasso, adaptive lasso, scout, and elastic net in terms of 
variable selection for n small. The UST presents good variable selection per- 
formances but large prediction errors. We note that, due to its large bias, the 
UST cannot be legitimately applied with cross-validation that uses validation 
error to select tuning parameters, especially when the coefficients are disparate 
and some correlations are large. The advantages of covariance-thresholded lasso 
with hard thresholding is less apparent compared with those of soft and adap- 
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Table 1: Example 1 performance results using fivefold cross-validation based on 200 



replications. 
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0.63 


0.683 (0.010) 




CT-Lasso hard 


1.221 (0.072) 
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0.704 (0.008) 
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five thresholding. This suggests that continuous thresholding of covariances may 
achieve better performances than discontinuous ones using cross-vahdation. We 
note that the scout procedures perform surprisingly poorly compared with the 
covariance-thresholded lasso and the elastic net in terms of variable selection 
when n is small. As the scout and elastic net are quite similar in terms of their 
potentials for variable selection as shown in Figure [3Ka) , the differences seem to 
come from the additional re-scaling step of the scout, where the scout re-scales 
its initial estimates by multiplying them with a scalar c = argmiuc ||y — cX/3|p. 
This strategy can sometimes be useful in improving prediction accuracy. How- 
ever, when n is small compared with p, standard deviations of validation errors 
for the scout can often be large, which may cause variable selection performances 
to suffer for cross-validation. We additionally note that, when p ^ n and SNR 
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Table 2: Example 2 performance results using fivefold cross-validation based on 200 
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0.440 (0.008) 




Adapt Lasso 


0.122 (0.004) 


4.0 


14.0 


0.20 


0.83 


0.423 (0.009) 




Elastic net 


0.089 (0.006) 


14.0 


48.5 


0.70 


0.39 


0.461 (0.012) 




UST 


0.042 (0.003) 


18.0 


66.0 


0.90 


0.18 


0.393 (0.006) 




Scout(l,l) 


NA 


NA 


NA 


NA 


NA 


NA 




Scout(2,l) 


0.038 (0.002) 


20.0 


80.0 


1.00 


0.00 


0.000 (0.000) 




CT-Lasso hard 


0.159 (0.007) 


6.0 


17.5 


0.30 


0.78 


0.468 (0.008) 




CT-Lasso soft 


0.107 (0.009) 


9.0 


27.0 


0.45 


0.66 


0.521 (0.007) 




CT-Lasso adapt 


0.129 (0.014) 


8.0 


24.0 


0.40 


0.70 


0.503 (0.007) 



is low as in this example, high specificity can sometimes be more important for 
prediction accuracy than high sensitivity. This is because, when n is small, coef- 
ficients of irrelevant variables can be given large estimates, and inclusion of but 
a few irrelevant variables can significantly deteriorate prediction accuracy. In 
Table [TJ we see that the lasso and adaptive lasso have good prediction accuracy 
for n = 20 though it selects less than half of the true variables. 

Example 2. (Constant covariance.) This example has p = 100 predictors 
with /3* = 3 for j £ {11,..., 20}, /3* = 1.5 for j G {31,..., 40}, and f3* = 
otherwise. Sjj = 0.95 for all i and j such that i ^ j, an d a = 15. SNR is 
approximately 8.58. This example, derived from Example 4 in lTibshirani (1996 ). 
presents an extreme situation where all non-diagonal elements of the population 
covariance matrix are nonzero and constant. 
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In Figure [3)^b) , we see that the covariance-thresholded lasso methods dom- 
inate over the lasso and adaptive lasso, especially for n small. This example 
shows that sparse covariance thresholding may still improve variable selection 
when the underlying covariance matrix is non-sparse. Furthermore, covariance- 
thresholded lasso methods with soft and adaptive thresholding perform better 
than that with hard thresholding. Interestingly, we see that the performance of 
UST decreases with increasing n and drops below that of the lasso for n > 30. 
This example demonstrates that the UST may not be a good general procedure 
for variable selection and can sometimes fail unexpectedly. We note that this is 
a challeng ing example for va. riable selection in general. By the irrepresentable 
condition (IZhao and Yu 2006 ) , the lasso is not variable selection consistent under 
this scenario. The median G values in Figure ^h) usually increase much slower 
with increasing n in comparison with those of Example 1 in Figure [3^ a), even 
though SNR is higher. 

Table [2] shows that the covariance-thresholded lasso methods and the elastic 
net dominate over the lasso and adaptive lasso in terms of variable selection when 
using cross-validation to select tuning parameters. The UST under-performs the 
lasso and adaptive lasso in terms of variable selection. Scout(2,l) does the worst 
in terms of variable selection by including all variables but presents the best 
prediction error. Again, we note that this may be due to the re-scaling step 
employed by the scout, which may sometimes improve performance in prediction 
but often suffers in terms of variable selection, especially when the sample size 
is small. 

Example 3. (Grouped variables.) This example has p = 100 predictors 
with (3* = {3, 3, 2.5, 2.5, 2, 2, 1.5, 1.5, 1,1,0,..., 0}. The predictors are generated 
as Xj = Zi + ^yn/3exJ for j G {1, . . . , 10}, Xj = Z2 + \/l/19ex,j for Xj G 
{11, . . . , 15}, and Xj = exj otherwise, where Zi ~ N{0, 1), Z2 ~ N{0, 1), and 
€xj ~ -^(0, 1) are independent. This creates within-group correlations of "Sij = 
0.15 for i,j E {1,...,10} and 'Sij = 0.95 for i,j E {11,..., 15}. a = 15 and 
SNR is approximately 1.1. This example presents an interesting scenario where 
a group of significant variables are mildly correlated and simultaneously a group 
of insignificant variables are strongly correlated. 

In Figure E^c) , we see that the covariance-thresholded lasso dominates gen- 
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Table 3: Example 3 performance results using fivefold cross-validation based on 200 
replications. 



n 


Method 


Tpe 


TP 


FP 


56725 


spec 


G 


20 




n 7fi1 ("n 024"! 


5.0 


12.0 


0.50 


0.87 


650 (0 0031 










1 1 n 




u.oo 


fi^9 (n 005) 




KJ J -L 




s n 


O -L .U 


u.ou 


u.uu 


n 71 9 (0 0071 








10.0 


90.0 


1.00 


0.00 


000 (0 000) 




Scoutf2 1) 


1 274 fO 08Ll 


8.0 


37.5 


0.80 


0.58 


536 (0 0171 




Elcistic ii6t 


1 891 (0 203) 


9.0 


32.0 


0.90 


0.64 


660 (0 0151 




CT-Lasso hard 


1.542 (0.105) 


6.5 


20.5 


0.70 


0.77 


0.636 (0.018) 




CT-Lasso soft 


1 240 (0 058) 


7.0 


22.5 


0.70 


0.77 


0.700 (0.013) 




CT-Lasso adapt 


1.427 (0.069) 


7.0 


19.0 


0.70 


0.76 


0.665 (0.015) 


40 


T jatiQn 


n 729 (0 0441 


8.0 


27.0 


0.80 


0.70 


748 (0 0081 






n fi'iQ ("0 ossl 


8.0 


21.5 


0.80 


0.76 


789 1^0 0081 




UST 


1 784 (V\ Of\R) 


10.0 


33.0 


1.00 


0.63 


789 fO 01 01 




Snnntn 1 ~1 

kJL^'J LI L \ J_ J J_ J 


n fsi s rn 01 


10.0 


68.5 


1.00 


0.24 


475 (0 1 031 




Scout(2,l) 


0.628 (0.037) 


10.0 


54.5 


1.00 


0.39 


0.616 (0.028) 




Elastic net 


1.101 (0.057) 


10.0 


35.0 


1.00 


0.62 


0.748 (0.012) 




CT-Lasso hard 


0.808 (0.045) 


9.0 


21.0 


0.90 


0.76 


0.806 (0.012) 




CT-Lasso soft 


0.723 (0.026) 


9.0 


22.0 


0.90 


0.76 


0.815 (0.007) 




CT-Lasso adapt 


0.760 (0.046) 


9.0 


22.0 


0.90 


0.76 


0.819 (0.009) 


80 


Lasso 


0.221 (0.013) 


9.0 


24.0 


0.90 


0.73 


0.825 (0.007) 




Adapt Lasso 


0.222 (0.017) 


10.0 


19.0 


1.00 


0.79 


0.864 (0.006) 




UST 


0.104 (0.008) 


10.0 


6.0 


1.00 


0.93 


0.946 (0.004) 




Scout(l,l) 


0.070 (0.005) 


10.0 


5.5 


1.00 


0.94 


0.937 (0.004) 




Scout(2,l) 


0.070 (0.003) 


10.0 


7.0 


1.00 


0.92 


0.938 (0.004) 




Elastic net 


0.104 (0.009) 


10.0 


9.0 


1.00 


0.90 


0.937 (0.005) 




CT-Lasso hard 


0.069 (0.005) 


9.0 


3.0 


0.90 


0.97 


0.938 (0.004) 




CT-Lasso soft 


0.063 (0.005) 


10.0 


4.0 


1.00 


0.94 


0.938 (0.003) 




CT-Lasso adapt 


0.063 (0.004) 


10.0 


4.0 


1.00 


0.96 


0.943 (0.003) 



erally in terms of variable selection. Similarly, Table [3] shows that the covariance- 
thresholded lasso does relatively well compared with other methods when using 
cross-validation to select tuning parameters. Further, the elastic net tends to 
have lower specificities than th e covariance-thresholde d lasso methods. In the 
related scenario of Example 4 in IZou and Hastie (20051 ) , where a group of signif- 
icant variables has strong within-group correlation and independent otherwise, 
the performances of elastic net are similar to those of covariance-thresholded 
lasso using soft thresholding, as both methods regularize covariances with large 
magnitudes. 

Methods of Cross- Validation. We examine the modified cross-validation 
presented in the beginning of this section. In Table [H we summarize results of 3 
variants of cross-validation from covariance-thresholded lasso with soft threshold- 
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Table 4: Performances of cross-validation methods based upon 200 replications of 
covariance-thresholded lasso with soft thresholding. CF_ includes additional variables 
up to 1 standard deviation of the minimum cross-validation error; CVq selects A„ at the 
minimum; and CV-^ discards variables up to 1 standard deviation of the minimum. 



Example 


n 


CV- 

sens spec G 


CVo 

sens spec G 


CV+ 

sens spec G 


Ex 1 


20 
40 
60 
80 


0.60 0.78 0.667 (0.007) 
0.80 0.77 0.739 (0.007) 

0.80 0.73 0.756 (0.013) 
0.90 0.73 0.789 (0.011) 


0.50 0.92 0.607 (0.009) 
0.60 0.92 0.699 (0.011) 
0.70 0.93 0.776 (0.013) 
0.80 0.94 0.827 (0.008) 


0.00 1.00 0.000 (0.086) 
0.30 1.00 0.548 (0.019) 
0.40 1.00 0.632 (0.016) 
0.50 1.00 0.707 (0.002) 


Ex 2 


20 
40 
60 
80 


0.33 0.71 0.465 (0.008) 
0.50 0.56 0.485 (0.006) 

0.55 0.54 0.504 (0.009) 
0.65 0.48 0.505 (0.009) 


0.23 0.83 0.409 (0.012) 
0.30 0.79 0.463 (0.009) 
0.35 0.71 0.497 (0.007) 
0.45 0.66 0.521 (0.007) 


0.15 0.89 0.362 (0.009) 
0.20 0.86 0.414 (0.014) 
0.30 0.79 0.473 (0.011) 
0.35 0.74 0.498 (0.008) 


Ex 3 


20 
40 
60 
80 


0.70 0.77 0.700 (0.013) 
0.90 0.76 0.815 (0.007) 
1.00 0.79 0.865 (0.006) 
1.00 0.80 0.882 (0.007) 


0.60 0.90 0.671 (0.015) 
0.80 0.91 0.846 (0.011) 
0.90 0.93 0.922 (0.004) 
1.00 0.94 0.938 (0.003) 


0.20 0.99 0.433 (0.117) 
0.60 0.99 0.762 (0.025) 
0.80 0.99 0.872 (0.018) 
0.80 1.00 0.894 (0.002) 



ing. Cross-validation by including additional variables up to 1 standard deviation 
of the minimum (CV-), cross-validation by minimum validation error (CVq), and 
cross-validation by discarding variables up to 1 standard deviation of the mini- 
mum {CV-^.) are presented. The largest G value and smallest bootstrapped stan- 
dard deviations of G among cross-validation methods are highlighted in boldface. 

The results demonstrate the overwhelming pattern that the proportion of 
relevant variables selected, or sensitivit y, decreases with n under cross-validation . 



We note that CV-^, as recommended in lHastie. Tibshirani. and Friedman (200ll ). 



does not work well in general for n < p. For n very small, GVq often selects too 
few variables, whereas, for n relatively large, GV- usually includes too many 
irrelevant variables. Moreover, when n is very small, bootstrapped standard 
deviations of G are usually the smallest for GV-, whereas, when n is relatively 
large, GVq usually yields better standard deviations of G. These observations 
suggest the modified cross-validation that employs GVq when n/^ > 5 and 
GV- when nj ^ < 5. 

5. Real Data 

In this section, we compare the performance of covariance-thresholded lasso 
with those of lasso, adaptive lasso, UST, scout(l,l), scout(2,l), and elastic net. 
We apply the methods to 3 well-known data sets. For each data set, we randomly 



COVARIANCE THRESHOLDING FOR VARIABLE SELECTION 



27 



partition the data into a training and a testing set. Tuning parameters are esti- 
mated using fivefold cross-validations on the training set, and performances are 
measured with the testing set. When n/-y/p < 5, the modified cross-validation de- 
scribed in Section 4 is used, where additional variables are included up to 1 stan- 
dard deviation of the validation error at the minimum. In order to avoid inconsis- 



tency of results due to randomization (jBgvelstad. Nvgard. St0rvold. Aldrin. Borgan. Frigessi. and Lingiserde 20( 
we repeat the comparisons 100 times, each with a different random partition of 
the training and testing set. In Table \5\ we report median test MSE or clas- 
sification error and number of variables selected. The smallest 3 test MSEs or 
classification errors are highlighted in boldface. In addition, standard errors 
based on 500 bootstrapped re-samplings are reported in parentheses. 

Highway data. Consider the highway accident data from an unpub- 
lished master's paper by C. Hoffstedt and examined in IWeisberg (1980l ). The 
data set contains 39 observations, which we divide randomly into n = 28 and 
nTest = 11 observations for the training and testing set, respectively. The 
response is y=accident rate per million vehicle miles. There are originally 9 
predictors, and we further include quadratic and interaction terms to obtain a 
total of p = 54 predictors. The original predictors are Xi=length of highway 
segment, X2=average daily traffic count, X3=truck volume as a percentage of 
the total volume, X4=speed limit, X5=width of outer shoulder, X6=number of 
freeway-type interchanges per mile, X7=number of signalized interchanges per 
mile, X8=number of access points per mile, and X9=total number of lanes of 
traffic in both directions. 

Table [5] summarizes the results obtained. Covariance-thresholded lasso meth- 
ods with hard, soft, and adaptive thresholding outperform the elastic net with 
12%, 44%, and 49% reductions in median tMSE, respectively, and the lasso with 
21%, 49%, and 54% reductions in median tMSE, respectively. The scout has the 
smallest tMSE. We note that this may be due to scout's additional re-scaling step, 
in which it multiplies its initial estimates by a scalar c = argmiuc ||y — cX/3p, 
as explained in Section 4. 

GDI data. Next, we consider the county demographic information (GDI) 
data from the Geospa tial and Statistical Data Center of the Un iversity of Vir- 
ginia and examined in iKutner. Nachtsheim. Neter. and Li (20051 ) . The data set 
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Table 5: Highway {n=28, nTest=ll, p=54), CDI {n=308, nTest=132, p=90), and 
Golub microarray {n=38, nTest=34, p=l,000) data performance results based on 100 
random partitions of training and testing sets. 



Method 


Highway 
tMSE no. 


GDI 

tMSE/10^° no. 


Golub Microarray 
test error no. 


Lasso 

Adapt Lasso 
UST 

Scout(l,l) 
Scout(2,l) 
Elastic net 
CT-Lasso hard 
CT-Lasso soft 
CT-Lasso adapt 


6.836 (0.917) 24 (0.5) 
6.246 (0.577) 22 (0.4) 
12.948 (1.138) 24 (0.9) 
3.121 (0.172) 20.5 (2.0) 
2.372 (0.292) 17.5 (1.9) 
6.165 (0.549) 31 (2.2) 
5.400 (0.481) 25 (1.4) 
3.480 (0.268) 21 (2.5) 
3.170 (0.486) 19.5 (1.3) 


0.925 (0.225) 82 (1.8) 
0.701 (0.263) 67.5 (4.0) 
1.562 (0.115) 20 (0.3) 
NA NA 
0.201 (0.014) 6 (1.1) 
0.216 (0.013) 22.5 (1.1) 
0.226 (0.021) 35.5 (4.7) 
0.185 (0.010) 21 (2.7) 
0.209 (0.015) 26 (2.4) 


3.0 (0.383) 37 (0.0) 
3.0 (0.401) 37 (0.0) 
2.0 (0.447) 198 (0.0) 
NA NA 
1.0 (0.472) 194 (2.2) 
3.0 (0.336) 26.5 (5.9) 
3.0 (0.388) 21 (3.7) 
3.0 (0.388) 24.5 (3.4) 
2.5 (0.476) 36 (0.0) 



contains 440 observations, which we divide randomly into n = 308 and riTest = 
132 observations for the training and testing set, respectively. The response is 
y =total number of crimes. There are originally 12 predictors, and we further in- 
clude quadratic and interaction terms to obtain a total of p = 90 predictors. The 
original predictors are Xi=land area, X2=population, X3=percent 18-24 years 
old, X4=percent 65 years old or older, X5=number of active nonfederal physi- 
cians, Xe=number of hospital beds, X7=percent of adults graduated from high 
school, X8=percent of adults with bachelor's degree, X9=percent below poverty 
level income, Xio=percent of labor force unemployed, Xii=per capita income, 
and Xi2=total personal income. 

Table [5] shows that the scout, elastic net, and covariance-thresholded lasso 
dominate the lasso and adaptive lasso in terms of prediction accuracy. Covariance- 
thresholded lasso with soft thresholding performs the best with 80% reduction 
in median tMSE from that of the lasso. Adaptive lasso methods with relatively 
large bootstrapped standard errors perform comparably to the lasso. 



Microarray data. Finally, we consider the microarray data from lGolub. Slonim. Tamayo. Huard. Gaasenb 
This example seeks to distinguish acute leukemias arising from lymphoid precur- 
sors ( ALL) and myeloid precursors ( AML). The data set contains 72 observa- 
tions, which we divide randomly into n = 38 and nTest = 34 observations for 
the training and testing set, respectively. For the response y, we assign values 
of 1 and -1 to ALL and AML, respectively. A classification rule is applied for 
the fitted response such that ALL is represented if y > and AML otherwise. 
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There are originally 7,129 predictors from Affymetrix arrays. We use sure in- 
dependence screen ing (SIS) with componentwise regression, as recommended in 



Fan and Lv (20081 ). to first select p = 1, 000 candidate genes. An early stop strat- 
egy is applied for all methods at the 200th step, and cross-validation is performed 
using the number of steps. 

Table [5] presents results in terms of test errors or the numbers of misclassi- 
fications out of 36 test samples. We note that performances of the covariance- 
thresholded lasso methods are comparable with those from the lasso, adaptive 
lasso, and elastic net in terms of prediction accuracy. However, covariance- 
thresholded lasso methods with hard and soft thresholding select comparably 
less variables than the lasso, adaptive lasso, and elastic net, whereas the scout 
severely over-selects with the number of variables selected close to the maximum 
of 200 due to early stopping. In the presence of comparable prediction accuracy, 
this may suggest that covariance-thresholded lasso can more readily differentiate 
between true and irrelevant variables under high-dimensionality. 

6. Conclusion and Further Discussions 

In this paper, we have proposed the covariance-thresholded lasso, a new re- 
gression method that stabilizes and improves the lasso for variable selection by 
utilizing covariance sparsity, which is an ubiquitous property in high-dimensional 
applications. The method presents as an important m arriage between methods of 
covariance regularization (jBickel and Levina 2008al ) and variable selection. We 



have shown theoretical studies and presented simulation and real-data exam- 
ples to indicate that our method can be useful in improving variable selection 
performances, especially when n. 

Furthermore, we note tha t there are many o ther va riable selection proce- 



dures, such as the relaxed lasso (IMeinshausen 20071 ). VISA (IRadchenko and James 20081 ) 



etc., that may well be considered for comparison in Section 4 for the n < p sce- 
nario. However, due to limit in space, we restrict ourselves to only closely related 
methods in this paper. We believe it can be interesting to further explore other 
methods for the n < p scenario using modified cross-validation and best-possible 
selection of tuning parameters, and we hope to include them in future works. 

Finally, sparse covariance-thresholding is a general procedure for variable 
selection in high-dimensional applications. In this paper, we applied covariance- 
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thresholding specifically to the lasso. Nonetheless, a myriad of var iable selection 
methods, such as the Dantzig selector ( Candes and Tao 20071 ). SIS ( Fan and Lv 20081 ) 



etc., can also benefit by utilizing covariance-thresholding to improve variable se- 
lection. We believe that results established in this paper will also be useful in 
applying sparse covariance-thresholding for variable selection methods other than 
the lasso. 

7. Appendix 

In this appendix, we first state and prove some preliminary lemmas that will 
be used in later proofs. Lemma 17.21 gives the upper bounds of SJi,^ and 
as estimates of ^cs and S55, respectively. Lemma [73] gives the upper bound 
of any sample covariance matrix as an estimate of its population counterpart. 
The rest of the appendix is dedicated to the proofs of results in Section 3.1. The 
proofs of results in Section 3.2, which we omit, are similar to those in Section 
3.1, except that v is set to be and Lemma E3] is used in place of Lemma [721 

7.1. Preliminary Lemmas 

Lemma 7.1 Suppose {Xki,Xk2, ■ ■ ■ ,Xkp), 1 < k < n, are independent and 
identically distributed random vectors with E(Xkj) = 0, E(XkiXkj) = dij, and 
EXlj < dlW^ for d G NU{0}, M > ^ andl < i, j < p. Let atj = ^ J2k=i ^kiXkj ■ 
Then, for tn = o(l), 

P {\aij - aij\ > tn) < exp{-cntl), (7.30) 
where c is some constant depending only on M . 
Proof of Lemma 

Let Zfc = XkjXkj — Gij. We apply the Bernstein's Inequality (moment ver- 



sion) (see for example lvan der Vaart and Wellner (19961 )') on the series ^^^=1 Zk- 



For m > 1, wehave^lZ^r = p < E"=o (7) k.^T-'^^I^H^fc,!'^ 

By the moment conditions in Lemma |7. 11 we have |(T,j| < M and E\X}^iXkj\'^ < 
\ {EXlf + EXlfj < d\M<^. Therefore, ^jZ^r < m!M"^^2Lo (7) = m\{2M)'^ , 
and result follows by applying the moment version of Bernstein's Inequality. 

□ 



COVARIANCE THRESHOLDING FOR VARIABLE SELECTION 



31 



Lemma 7.2 If u is chosen to be greater than C ^ylog{s{p — s))/^/n for some C 
large enough, then 



^cs 



'CS 



(7.31) 



< Op {ud*cs) + Op [d^jsV^^sisip- s))/VEj . 
If I' is chosen to be greater than C\f\o^l ^pri for some C large enough, then 

(7.32) 



^SS 



'SS 



< Op {ud*ss) + Op (d*ssV2k^/V^) . 



The proo f is similar to that of T heorem 1 in lBickel and Levina (2008al ) and 
Theorem 1 in lRothman et al. (2009 ). and, thus, it is omitted to save space. The 
detailed proof can be found in the supplementary document. 

Lemma 7.3 Let A and B be two arbitrary subsets o/{l, 2, . . . and let = 

{(^ij)ieA,j€B o-iT'd = i^ij)iGA,j€B- Further, let a be the cardinality of A and 
b the cardinality of B. Suppose a and b satisfy y^log{ab) / y/n — >• as n —>• oo. 
Then 

WEab - SabIIoo = Op (by/log{at 



Proof of Lemma 7. 3 
Since 



P (^\\±AB - 5]ab||oo >t) <Y.Y.^ (I'^^i ~ > */^) <a-b- eM-cnt'^/b'^), 

ieAjeB 



for t/b = o(l) by Lemma |7. 11 the result follows. 



□ 



7.2. Proof of Lemma [H7T] 

By the KKT conditions, the solution of (j2.3p satisfies Yiy jS^—^'XJ'y+XnZ = 0, 
where z is the sub-gradient of that is, z = Plugging in y = 

X/3* + e, we have 

1. 



- /3*) + (S^ - I])/3* - -X^ e + Xnz = 0. 



n 



(7.33) 



It is easy to see that sgn^") = sgn{f3*) holds if 7^ 0, = 0, % = sgn{/3g), 
and \zc\ < 1. Therefore, based on ()7.33p . the conditions for sgn{(3'^) = sgn{f3*) 
to hold are 



^ss)l^*s - ~^s^ 



-Xnsgn{/3*g), 



(7.34) 
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sgnif^s) = sgn{/3s), 
^csiPs - f3*s) + i'^cs - ^csWs - ^X^e < A„. 

oo 

Solving (j7.34p for under the assumption Amin (^^ss) ^ ^' ^® have 

1. 



(7.35) 
(7.36) 



'SS 



(7.37) 



Substituting (j7.37p into the left-hand side of (|7.36p and further decomposing the 
resulting equation, we have 



n 



< 



+ 



< 



^cs - ^cs ) P*s 



n 



'SS - ^ss I P*s 



+ 



n 



n ^ 



+ si^p + A„ + sup + 



n *^ 



where the last inequality is obtained by 



^5S^ 


/35 


< 

oo 


^SS 


- 




/55 


< 

oo 


^CS 





^Slloo 



< svp. 



Then, condition (13. lip is sufficient for ()7.36p to hold. 

Next, we derive (l3J2]l . By (17:371) . (17:35]) is imphed by 



u \-l 



+ 



n ^ 



+ A„ < p. 



(7.38) 



(7.39) 



Plugging in the upper bound of \\{'S'^^g — 'Sss)l^s\\°o (|7.38p . it is straightforward 
to see that (|3.12p is sufficient for (j7.35p to hold. □ 



7.3. Proof of Lemma [33] 

For any v with = 1, 



'SS - ^ss\ 



> Amin C^SS) - W^SS ~ '^SsWoo, 
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and, when choosing v = C^Jlogs/ ^/n for some C > 0, 



■'SS ~ ^551100 



(7.40) 



by Lemma |7.2[ Therefore, the result follows under the condition (j3.15p . □ 



7.4. Proof of Lemma 13.31 

Tod 
position. 



To derive the upper bound of IKXl^^.) ^||oo, we perform the following decom- 



^ss 



< 



Because 



^SS 



{^ssY 



< 



■'SS) 



i^ss) 



+ 



< D 



-1 



■•SS) 



^SS 



^SS 



+ 



^SS 



SS - ^SS 



+ D 



■'SS) 



^55 - ^SS 

- i^ssy 

-1 



(7.41) 











^ss - ^ss 








CO 



'^ss 



i^ssY 



Zj o 



SS - ^SS 



where the second inequality is obtained by (j7.4ip . we have 



< 





^SS ~ 


'^ss 


00 


l-D 


^SS 


- 


00 



(7.42) 



n 



where the last inequality is derived by choosing v = C Y^log(p — s)/ -^/n, applying 
([732]) in Lemma [721 and using the condition ([3l^ . Combining ([7^ . ([732]), 
and condition (|3.18|) . we have 

-1 



^ss 



< Op [D) . 



(7.43) 



For (j3.2ip . we decompose 5]^^(1]^5.) ^ into three terms as follows: 



-11/ ( y 
'cs y 



•SS 



■•cs 



^ss 



{^ss) 



-1 



cs - ^cs 



(^55) + ^cs (^ss) 



I + II + III. 



By condition (j3.17p . ||III||oo < 1 — £• Therefore, it is enough to show that ||I||oo + 
1 1 II 1 1 00 < e/2 with probability going to 1. 
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For lllllloo, we have 



nil 



< 



and, when choosing u = C \/\og{s{p — s))/ y/n, 

< Op {d*csyJ\og{s{p - s))/^/n 
by Lemma |7.2[ For ||I||oo) we have 



^cs - ^cs 



(7.44) 



(7.45) 



< II 



^SS 



■'SS) 



^ss 



(7.46) 



< Op (D^d^^d*^ Vlog(s(p - s 

by (ITIiSD . (IZaa), and (ICTl . 

In summary, we have P(||I||oo + ||II||oo<e/2) — )• 1 under the condition 
p.l9p . This completes the proof. □ 

7.5. Proof of Theorem [Q] 

First, we consider ||^X^e||^ and ||iX^g!e||^, which appear in (j3.1ip and 
(j3.12p . respectively. Since e ~ N{0,a'^), then, when X is fixed, by standard 
results on the extreme value of multivariate normal, we have 



n ^ 



Op a 2{ma.x a jj)log{p - s)/y/n 



n 



X^e 



Op 2 {max a jj) log s/^/n . 



(7.47) 
(7.48) 



By Lemma |7.H 



P ( maxo-jj > M + t\ <^P {ajj > M + t) <^P {ajj - ajj > t) < p ■ exp{-nf 

V ^ / j=i j=i 

for t = 0(1), and, thus, P (maxj ajj < M) — )■ 1. Therefore, 



-Xge 



n 



Op ( ^/log{p - s)/ y/n 



-X^e 



n 



Op [yJ\ogs/y/n 



(7.49) 
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Now, we sum up the results in Lemma [3.H 13.21 13.31 ()7.49p . Under condi- 
tions (I3J3]) . (f3J4]) . (f3T7l) . (f3J9]) . ([H^ZD, and the choice ofi/ = Cy'logCp - s)/Vn 
for some C and A„ as in ()3.23p . both ()3.1ip and p.l2p hold with probability going 
to 1. □ 

7.6. Outline of the Proof of Theorem 12.11 

To circumvent the problem of having a non-differentiable penalty function, 
we reformulate the optimization problem in (j2.3p as the following, 

argmin/3+ ^- i(/3+ - /3-)^5],(/3+ - /?-) - - /3-f (iX^y) , 
s. t. /3- > Vi, /3+ > OVj, + < t. 

Consider the Lagrangian primal function for the above formulation, 



/ j=l j=l _j=l 



Let f3 = j3~^ — (3 . We obtain the following first-order conditions, 

ixjy - (S,)J/3 - A + A+ = 0, ^xjy - (S,)J/3 + A - A" = 0, 



These conditions can be verified, as in (jRosset and Zhu 20071 ). to imply the facts, 



l^xjy - (S.)J/3| < A ^ /3j = and /3j / ^ l^xjy - = A. 



When is semi-positive definite, first-order conditions are enough to pro- 
vide a global solution, which is unique if all eigenvalues are positive. However, 
when there exist eigenvalues of that are negative, a second-order condition, 
in addition to first-order ones, is required to guarantee that a point /3 is a lo- 
cal minimum. Assume strict complementarity = =^ A^ > and A7 > 0, 
which holds with high probability as regression methods rarely yield zero- valued 
coefficient estimate without penalization. We see that /C = {z = z+ — z~ / : 
= and z7 = for /3j = 0} covers the set of feasible di rections in Theorem 



6. lMcCormick fl97fil l. Let A = {j : /3j / 0}. By Theorem 6. lMcCormick (W7(i ). 



a solution /3 is a local minimum if for every z G /C 

z^(S,)z = (z^f(5],Uz^>0. 
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Furthermore, we note that the solution j3 is global if |xjy/n| < A for all j ^ A 
in addition to {^u)a being positive definite. This follows from facts implied by 
first-order conditions. 

Algorithm for computing piecewise- linear solutions for the covariance-thresholded 
lasso is derived b y further manipulating the first-order conditions as in the proof 
for Theorem 2 in 



Rosset and Zhu ^20071 ^ 
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